Performance tests
ROUGE
ROUGE is a metric designed to evaluate summary quality against a reference text by measuring the token-level overlap between the text generated by a model and the references. DynamoFL provides support for 3 types of ROUGE scores: ROUGE-1, ROUGE-2, and ROUGE-L.
ROUGE-1: measures the overlap of unigrams (i.e., tokens of length 1) between the model generated outputs and reference texts
ROUGE-2: measures the overlap of bigrams (i.e., tokens of length 2) between the model generated outputs and reference texts
ROUGE-L: measures the longest common subsequence between the generated text and reference
BertScore
BERTScore computes semantic similarity between a reference text and model response, leveraging the pre-trained embeddings from the BERT model and is represented by a precision, recall, and F1 score.
Precision: measures the average cosine similarity for each token in the generated output to the reference text
Recall: measures the average cosine similarity for each token in the reference text to the generated output
F1 Score: represents maximizing both precision and recall